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Abstract 


We present two new results for the problem of approximating a given real m x n matrix A 
by a rank-k matrix D, where k < min{m,n}, so as to minimize ||A — D||%. It is known that by 
sampling O(k/e) rows of the matrix, one can find a low-rank approximation with additive error 
e||A||%. Our first result shows that with adaptive sampling in t rounds and O(k/e) samples in 
each round, the additive error drops exponentially as €¢; the computation time is nearly linear in 
the number of nonzero entries. This demonstrates that multiple passes can be highly beneficial 
for a natural (and widely studied) algorithmic problem. Our second result is that there exists a 
subset of O(k?/e) rows such that their span contains a rank-k approximation with multiplicative 
(1 + €) error (i.e., the sum of squares distance has a small “core-set” whose span determines 
a good approximation). This existence theorem leads to a PTAS for the following projective 
clustering problem: Given a set of points P in R%, and integers k,j, find a set of 7 subspaces 
F,..., Fj, each of dimension at most k, that minimize )7- p min; d(p, F;)?. 


1 Introduction 


Given data consisting of points in high-dimensional space, it is often of interest to find a low- 
dimensional representation. In this paper, we consider the general problem of finding one or more 
(up to 7) subspaces, each of dimension at most k, and representing each point by its orthogonal 
projection to the nearest subspace. Our goal will be to minimize the sum of squared distances of 
each point to its nearest subspace, a measure of the “error” incurred by this representation. 

This problem has been called projective clustering, since the 7 subspaces induce a partition of 
the point set. Algorithms and systems based on projective clustering have been applied to facial 
recognition, data-mining, and synthetic data [6], motivated by the observation that no single 
subspace performs as well as a few different subspaces. It should be noted that the advantage of a 
low-dimensional representation is not merely in the computational savings, but also the improved 
quality of retrieval. We discuss related theoretical work in Section 

The case of 7 = 1, i.e., finding a single k-dimensional subspace is an important problem in itself 
and can be solved efficiently (for 7 > 2, the problem is NP-hard [23], even for k = 1 [I1]). Viewing 
the points as the rows of an m x n matrix A, we find the top k right singular vectors of this matrix 
via the Singular Value Decomposition (SVD). The projection itself is given by the rank k matrix 
Ay = AYY™ where the columns of Y are the top k right singular vectors of A. Note that among all 
rank k matrices D, Ax is the one that minimizes ||A—D||?, = ig Aig — Dj;)*. The running time of 
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this algorithm, dominated by the SVD computation, is O(min{mn?,nm?}). Although polynomial, 
this is still too high for some applications. 

For problems on data sets that are too large or expensive to store/process in their entirety, one 
can view the data as a stream and the goal is to store/process a subset chosen judiciously on the 
fly and then extrapolate from this subset. Motivated by the question of finding a faster algorithm, 
Frieze et al. [16] showed that any matrix A has a subset of k/e rows whose span contains an 
approximately optimal rank k approximation to A. In fact, the subset of rows can be obtained as 
independent samples from a distribution that depends only on the lengths of the rows. (In what 
follows, A“® denotes the ith row of A, as a column vector.) 


Theorem 1 ({16]). Let S be a sample of s rows of anm xn matrix A, each chosen independently 
from the following distribution: Row i is picked with probability 
, A412 
72 C€ : 
Alle 


If s > k/ce, then the span of S contains a matriz Ay, of rank at most k for which 
[A — Aall® < A — Anll® + €llAllé. 


This can be turned into an efficient algorithm based on sampling The algorithm makes one 
pass through A to figure out the sampling distribution and another pass to sample and compute the 
approximation. Its complexity is O(min{m, n}k?/e4). These results lead to the following questions: 
(1) Can the error be reduced significantly by using multiple passes through the data? (2) Can we 
get multiplicative (1+) approximations? (3) Do these sampling algorithms have any consequences 
for the general projective clustering problem? 


1.1 Our results 


Our first result is that the additive error term drops exponentially with the number of passes. Thus, 
low-rank approximation is a natural problem for which multiple passes through the data are highly 
beneficial. 

The idea behind the algorithm is quite simple. As an illustrative example, suppose the data 
consists of points along a 1-dimensional subspace of R” except for one point. The best rank 2 
subspace has zero error. However, one round of sampling will most likely miss the point far from 
the line. So we consider the following two-round approach. In the first pass, we get a sample 
and find a rank 2 approximation using it. Then we sample again, but this time with probability 
proportional to the error of the approximation. If the lone far-off point is missed in the first pass, 
it will have a very high probability of being chosen in the second pass. The span of the full sample 
now contains a very good rank 2 approximation. In the general theorem below, for a set of rows S$ 
of a matrix A, we denote by m5(A) the matrix whose rows are the projection of the rows of A to 
the span of S. 


Frieze et al. go further to show that there is an s x s submatrix for s = poly(k/e) from which the low-rank 
approximation can be computed in poly(k, 1/e) time in an implicit form. 


Theorem 2. Let S = S;U---US; be a random sample of rows of an m x n matrix A where 
for j =1,...,t, each set S; is a sample of s rows of A chosen independently from the following 
distribution: row i is picked with probability 
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where Ey = A, Ej = A—Tg,u..us;_,(A) and c is a constant. Then for s > k/ce, the span of S 
contains a matrix A, of rank k such that 


: 1 
Es((|4 — Aglle) < IIA - Aalle + ellAlle- 


The proof of Theorem [2] is given in Section [2] The resulting algorithm, described in Section 
uses 2¢ passes through the data and O(Mst + (m+ n)s?t?) computation time where M is the 
number of nonzeros in A. Although the sampling distribution is modified t times, the matrix itself 
is not changed and so its sparsity is maintained. The algorithm fits the streaming model in that 
the entries of A can arrive in any order (see Section 1.2}. The space used is O((m + n)kt/e). 

Theorem [2|implies that for any matrix A, there exists a subset of kt/e€ rows whose span contains 
a rank-k matrix whose error is within an additive ¢¢||A||} of the best rank-k matrix. Can this be 
improved? In particular, is there a small subset of rows whose span contains a rank-k matrix whose 
error is within a (1+) multiplicative factor of the error of the best possible rank-k approximation? 
Our next theorem answers this question affirmatively. 


Theorem 3. For any matrix A, there exists a subset of 4k2/e€ rows in whose span lies a rank-k 
matrix A; such that - 
|A — Anll < (1+ 6)|[A — Aall?- 


The proof of this theorem also uses iterative sampling, albeit in a “backwards” manner and 
yields a simple sampling-based algorithm for finding such an approximation only for k = 1. The 
proof for general k is by induction on k, and uses the sampling algorithm for k = 1 to extend 
the (approximately) best (k — 1)-dimensional subspace to an approximately best k-dimensional 
subspace. Although this existence result does not imply an algorithm faster than the SVD for 
finding such an approximation, it will be the key ingredient in our last result—a polynomial-time 
approximation scheme (PTAS) for the general projective clustering problem (j > 2). 

We restate the problem using the notation from computational geometry: Let d(p, F’) be the 
orthogonal distance of a point p to a subspace F. Given a set of n points P in R®, find a set of j 
k-dimensional subspaces F),...,F; such that 


C({F,... Fj}) = > min d(p, Fj)? 
pEeP 


is minimized. When subspaces are replaced by flats, the case k = 0 corresponds to the j-means 
problem (with the sum of squares objective function). 

Theorem |3] suggests an enumerative algorithm. The optimal set of k-dimensional subspaces 
induces a partition P),...,P; of the given point set. In each set P;, there is, by Theorem |3} a 
subset of size O(k?/e) in whose span lies a (1 + €) approximation to the optimal k-dimensional 
subspace for this set P;. So we consider all possible combinations of 7 subsets each of size O(k?/e), 


and a 6-net of k-dimensional subspaces in the span of each subset. The 6-net depends on the points 
in each subset and is not just a grid, as is often the case. Each possible combination of subspaces 
induces a partition and we simply output the best. Since the subset size is bounded (and so is the 
size of the net), this gives a PTAS for the problem (see Section [5). 


Theorem 4. Given n points in R¢ and parameters B and e, in time 


d @ O(ik? /e) 


we can find a solution to the projective clustering problem which is of cost at most (1+¢)B provided 
there is a solution of cost B. 


Our technique can also be viewed as an extension of the idea of core-sets. Roughly stated, 
a core-set is a small subset of the data which captures a near-optimal solution to the entire set. 
Theorem |3] states that there is a core-set of size O(k?/e) for minimizing the squared error to a 
rank k subspace. Unlike proofs of core-sets for other problems, our proof relies on the probabilistic 
method along with the properties of the SVD. Finally, our result can be extended to finding affine 
subspaces instead of linear subspaces. 


1.2. Related work 


Following the work of and which introduced matrix sampling for fast low-rank approxima- 
tion, Achlioptas and McSherry gave an alternative sampling-based algorithm for the problem. 
Their algorithm achieves similar bounds (see [1] for a detailed comparison) using only one pass. It 
does not seem amenable to the multipass improvements presented here. Subsequently, Bar- Yossef 
[9] has shown that the bounds of these algorithms for one or two passes are optimal up to polynomial 
factors in 1/e. 

These algorithms can also be viewed in the streaming model of computation [20]. In this model, 
we do not have random access to data; the data comes as a stream and we are allowed one or 
a few sequential passes over the data. Algorithms for the streaming model have been designed 
for computing frequency moments [7], histograms [17], etc. and have mainly focused on what 
can be done in one pass. There has been some recent work on what can be done in multiple 
passes [15]. The “pass-efficient” model of computation was introduced in [20]. Our multipass 
algorithm fits this model and investigates the tradeoff between approximation and the number of 
passes. Feigenbaum, et. al show such a tradeoff for computing the maximum unweighted 
matching in bipartite graphs. 

The results of our paper connect two previously separate fields — low-rank approximation and 
projective clustering. As mentioned earlier, projective clustering has been used in various contexts 
[6]. In [4], the authors consider the same problem as in this paper, and propose a variant of 
the j-means algorithm for it. Their paper has promising experimental results but does not provide 
any theoretical guarantees. There are theoretical results for special cases of projective clustering, 
especially the j-means problem (k = 0, find 7 0-dimensional affine subspaces, i.e., points). Drineas 
et al. gave a 2-approximation to j-means using SVD. Subsequently, Ostrovsky and Rabani 
gave the first randomized polynomial time approximation schemes for j-means (and also the 
j-median problem). Matousek and Effros and Schulman both gave deterministic PTAS’s 
for j-means. Fernandez de la Vega et al. describe a randomized algorithm with a running time 
of O(n(logn)?™).z Using the idea of core-sets, Har-Peled and Mazumdar showed a (1 + €) 


approximation algorithm that runs in linear time for fixed j,e. Kumar et al. give a linear- 
time PTAS that uses random sampling. There is a PTAS for k = 1 (lines) as well [2]. Other 
objective functions have also been studied, e.g. sum of distances (j-median when k = 0, [18] ) 
and maximum distance (j-center when k = 0, |8|). For general k, Har-Peled and Varadarajan 
give a (1 + €) approximation algorithm for the maximum distance objective. Their algorithm runs 
in time dnOG** les(1/e)/e) and is based on core-sets (see [8] for a survey). 


1.3. Notation and Preliminaries 


Let A € R™*”", Let AM denote the ith row of A, seen as a column vector. Any matrix accepts a 
singular value decomposition, that is, it can be written in the form 


A= > Pen 
i=1 


where r is the rank of A and 0; > 02 >--- > on > Oare called the singular values; {u“),...,u")} € 
R™, {y), = uir)} € R” are sets of orthonormal vectors, called the left and right singular vectors, 
respectively. It follows that A?u® = o;v and Av = o;u for 1<i<r. 

The Frobenius norm of a matrix A € R™*" having elements (a;;) is denoted ||A||~ and is given 
by 


It satisfies || A||?, = S7"_, 0? 


- 

For a subspace V C R”, let wy,4(A) denote the best rank-k approximation (under the Frobenius 
norm) of A with its rows in V. Let m,(A) = mpnx(A) = a ouOy@" be the best rank-k 
approximation of A. Also ty(A) = 7y,,(A) is the orthogonal projection of A onto V. When we say 
“a set (or sample) of rows of A” we mean a set of indices of rows, rather than the actual rows. For 
a set S of rows of A, let span(.S) C R” be the subspace generated those rows; we use the simplified 
notation mg(A) for Tspan(s)(A) and 79,4(A) for Tspan(s),h(A)- 

For subspaces V,W C R”, the sum of them is denoted V + W and is given by 


V+We={e+yeER”": xceViyew}. 
The following elementary properties of the operator wy will be used: 
e wy is linear, that is, ty(AA+B) = Any(A)+7y(B) for any \ € R and matrices A, B € R™*". 


e If V.W e€ R” are orthogonal linear subspaces, then my+w(A) = av(A) + mw(A), for any 
matrix A € R™*”. 


For a random vector v, its expectation, denoted E(V), is the vector having as components the 
expected values of the components of v. 


2 A random sample contains a good approximation 


We will prove Theorem|2|in this section. It will be convenient to formulate an intermediate theorem 
as follows. 
Theorem 5. Let A € R™*". Let V C R” be a vector subspace. Let E = A—my(A). For a fixed 
c ER, let S be a random sample of s rows of A from a distribution such that row i is chosen with 
probability 
(ES |? 
"EIR 


(1) 


Then, for any nonnegative integer k, 


k 
Es(l|A — Ty4span(s),b(A)llm) < A — mA) + og llElle- 


Proof. For S = (r;)%_, a sample of rows of A and 1 < j <r, let 


Then, Es(w)) = ty(A)Tu) + BPu® = oj;v%. Now we will bound Eg(||w — a;v||?). Use the 
definition of w) to get 


. . ice ud? ; 
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Apply the norm squared to each side and expand the left hand side: 


(i) (i) 12 uy? * all (ri). (eT T, ()|2 
Jw? — ajv97 | = Dis Ee i 2 (BT ul) + |B? u |. (2) 
Th $i Ti 
Observe that 
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which implies that 


Using this, apply Es to Equation to get: 


Bs([lw) — ojv||*) = Bg 


Now, from the left hand side, and expanding the norm squared, 
2 
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where 


ur BOP ‘i en Ph le? BOP 
yes (Me a). Spy cael 6) 


i=1 [=1 1=1 
and, using the independence of the r;’s and Equation (3). 
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The substitution of Equations (6) and in gives 


: 2 F ; 
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Using this in Equation (4) we have 
(3) (2) 1 
Re(lw — oy lu; coe EP yO 
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and, using the hypothesis for P; (Equation (1)), remembering that uw) is a unit vector and dis- 
carding the second term we conclude 


: : 1 
Eg((lw) — 030 |?) < —|Blp. (8) 


Let 9%) = zw) tor 9 = 125.3) k = min{k,r} wn of k’ as equal to k, this is the 


interesting case), let W = span{g™,..., g)}, and F=A ea )g®T_ We will bound the error 
|| A—my(A)||} using F’. Observe that a row space of F is ae in W and my is the projection 


operator onto the subspace of all matrices with row space in W with respect to the Frobenius norm. 
Thus, 


A — aw(A)|l# S || — Fille. (9) 
Moreover, 
|| A — A= (A- F)?ul |? = Io! a ae ee (10) 
i=l isk’ +1 


Taking expectation and using we get 


k 
Es(\|[A— Fllp) < 3 oF + * BIR = = ||A — (A) || + — Elle. 
i=k+1 


This and Equation (9) give 


k 
Es(l|A — tw(A)llz) < ||A - te(A) ll + SIIEIlF- (11) 


Finally, the fact that W C V + span(S) and dim(W) < k imply that 


A — ty4span(s),k(A)ll® < IA — tw(A)llz, 


and, combining this with Equation (11), we conclude 


k 
Es (A — ty4span(s),k(A) ll) < ||A — me(A) ll + —llElle- 


We can now prove Theorem [2] inductively using Theorem 


Proof. (of Theorem (2). We will prove the slightly stronger result 


: lay 
Es(\|A — m3,4(A)|l-) < ——S— 


k t 
A= m(AIE + (2) Alle 


by induction on t. The case t = 1 is precisely Theorem [1] 
For the inductive step, let E = A — ms,v...us,_,(A). By means of Theorem [5] we have that, 


k 
Bs, (||A — ™s,u..us,,e(A)ll) < ||A — me(A) II + ~g Elle 


Combining this inequality with the fact that ||E||} < ||A — 7,u...us;_,,4(A)||? we get 


k 
Es,([|A — ms,u--us.4(A)ll) < A — te(A) Il + IA — t5,0-usea4(A) ll 


Taking the expectation over S1,..., 54-1: 


ni} k y 
Es(l|A — msu-us.4(A) lle) < lA — te(A) lle + Esy,...5:4 (IA — 75,U-us,1 (Dll) 


and the result follows from the induction hypothesis for t — 1. 


3 Algorithm 


In this section, we present the multipass algorithm for low-rank approximation. We first describe 


it at a conceptual level and then give the details of the implementation. 


Informally, the algorithm will find an approximation to the best rank-k subspace (the span 
of v),...,u0) by first choosing a sample T of s random rows with density proportional to the 
squared norm of each row (as in Theorem (1). Then we focus ourselves on the space orthogonal to 
the span of the chosen rows, that is, we consider the matrix HF = A — (A), which represents in 
some sense the error of our current approximation, and we sample s additional rows with density 
proportional to the squared norm of the rows of E. We consider the union of this sample with our 
previous sample, and we continue adding samples in this way, up to the number of passes that we 


have chosen. Theorem 2] gives a bound on the error of this procedure. 


Fast SVD 


Input: A € R™*", integers k < m, t, error parameter € > 0. 
Output: hy ...,hg € R” such that with probability at least 3/4 their span V satisfies 


de 
l—-e 


A — ny (ADI < (1 + ) A — mg A)II2e + 4etI,All2- (12) 


1, Lek S=0) =e 
2. Repeat t times: 


(a) Let EB = A— (A). 


(b) Let T be a sample of s rows of A according to the distribution that assigns probability 
ZO |? . 
‘TENS to row 2. 


(¢) Let 6 =S UT. 


3. Let hi,...,h, be the top & right singular vectors of m5(A). 


Let M be the number of non-zeros of A. 


Theorem 6. Algorithm Fast SVD is correct and has running time O (ue +(m+n) ). 


Proof. For the correctness, observe that my(A) is a random variable with the same distribution as 
™s,h(A) as defined in Theorem [2| Also, || A — 7,4(A)||% — || A — 7% (A)||% is a nonnegative random 
variable and Theorem |2] gives a bound on its expectation: 


\ € 
Es(|A — t5,4(A)ll — IA — te(ADlle) < 7_<|4- mr(A)ll> + | All®- 


Markov’s inequality applied to this variable gives that with probability at least 3/4 


de 
l—-e 


A = my(A) 3s = [A — (ADI < IA — ma ADR + 4 AI. 
which implies inequality (12). 

We will now bound the running time. We maintain a basis of the rows indexed by S. In 
each iteration, we extend this basis orthogonally with a new set of vectors Y, so that it spans 
the new sample T. The residual squared length of each row, ||E||?, as well as the total, ||E||2, 
are computed by subtracting the contribution of mr(A) from the values that they had during the 
previous iteration. In each iteration, the projection onto Y needed for computing this contribution 
takes time O(Ms). In iteration i, the computation of the orthonormal basis Y takes time O(ns?%) 
(Gram-Schmidt orthonormalization of s vectors in R” against an orthonormal basis of size at most 
s(i+1)). Thus, the total time in iteration i is O(Ms+ns?i); with t iterations, this is O(Mst+ns?t?). 
At the end of Step [2] we have 7s(A) in terms of our basis (an m x st matrix). Finding the top 
k singular vectors in Step [3] takes time O(ms?t?). Bringing them back to the original basis takes 
time O(nkst). Thus, the total running time is O(Mst + ns?t? + mst? + nkst) or, in other words, 


O (Mi + (m+n)EF). 


4 Existence of a subset with multiplicative (1+) error 


In this section, we will prove Theorem [3} The first step is to show that for any matrix A, there is 
a row a such that the span of a is a factor 2 approximation to the best rank-1 subspace. 


Lemma 7. In any matriz A, there is a row a such that 
|| — tspan(a)(A) I S 2|.4 — m1 (A) |[B- 


Proof. As in the proof of Theorem [5] let b be (the index of) a random row of A picked according 

to the distribution that assigns probability P; = ||A©||?/||Al|% to row i. Define the random vector 
(1) 

— Ye 4) 

=— A 

w P, 


where u) is the top left singular vector of A. Then, we have E(w) = o,v and, by expanding 


2 
, we also have 


||w = ov) | 
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and, writing the expectation as a sum and using the definition of Py, 


m (1) 4(6) 1/2 
=(Soaip es) 6? 
|4OP 
b=1 
= AI — 0? 


= ||A—m(A)|lz. 


Therefore, as in Equations (9) and (10), 


E(||A — mepan(s)(A) I) SE (Jew — oro |?) + $9 0? < 2.4 — (ADI. 
1=2 


and hence there exists a row that proves the lemma. 


Proof. (of Theorem [3}) We will prove by induction on k that for integers s, k and A € R™”*” there 
exists a subset S of rows of A of size (s + 1)k such that 


2 k 
|A-nsa(A)leS (142) [A= m(A)IF. 


By Lemma [7| there is a row b of A such that 


lA = tepan(e)(A) ll < 2\|A — 71 (A)||F- 
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For k = 1, apply Theorem [5] with V = span (b); we get that there is a set S of s indices of rows 
of A such that 


1 
|A mug (Alle < |A-m(A)lle + 3llA- Tspan(b) (A)Il% 


s (142) [A-m(yii. 


/\ 


Now, to extend this to higher k, one might consider the approach of finding such a sample to 
approximate the best rank 1 subspace, projecting orthogonal to it and repeating. This does not 
work. The reason is that the error incurred by a higher rank approximation could be much smaller: 
an € multiplicative error at an earlier stage (e.g., for the first stage, multiplicative with respect to 
|| A — 71(A)||%, the best rank-1 approximation) is already too large (given that at the end we want 
a multiplicative error with respect to ||A — 7(A)||}, the best rank-k approximation). 

Instead, for proving the inductive step, we will use the sampling idea backwards. In our context, 
finding a rank-(k + 1) approximation to A is equivalent to finding a (k + 1)-dimensional subspace 
V onto which to project, so that my(A) is the approximation. If we think of our problem as finding 
such a subspace, then the quality of subspace V is given by ||A — my(A)||?. We want to find a 
(k + 1)-dimensional subspace in the span of (s + 1)(k + 1) rows of A that is worse than the best 
possible only by a multiplicative factor of (1+ 2/s)**+1.We will first show the existence of s+ 1 rows 
such that if we replace v) in the best rank-k + 1 subspace (generated by vo. ulk+I)) with a 
vector in the span of those rows, then the resulting & + 1-dimensional subspace is worse that the 
best by at most a (1 + 2/s) multiplicative factor. Then we will focus on our problem projected 
onto the orthogonal complement of span(b) to get from the inductive hypothesis that the best k- 
dimensional subspace it this restricted problem (which happens to be generated by v2), ulk+I)) 
can be replaced by another k-dimensional subspace in the span of (s+1)k rows, which is worse that 
the best by a multiplicative factor of at most (1+2/s)*. The combination of these two replacements 
will give the desired result. 

Suppose inductively that for some k and any matrix A and 1 < k’ < k, there exists a subset of 
rows of A of size (s + 1)k’ such that 


k’ 
A= nsw (A)e Ss (142) Am (ADIE: 


To prove this for k +1, let V be the optimal rank-(k + 1) subspace. Let V’ be the subspace of V 
orthogonal to v™). Project the rows of A orthogonal to V’ to get a matrix C, i.e., 


C= A- mTy(A). 


Applying the hypothesis with k’ = 1 to C, we get that there exists a vector b in the span of a set 
S” of s+ 1 rows of C such that 


2 2 
IC — mspanoy(C)II® S (1+ =) IC — m(C)IR = (14+ =} NO — topanomy (Cl = (13) 
f r (v)) 
Notice that V’ and span(b) are orthogonal subspaces, this implies that 
|| A — TV! 4+span(b) (A) \I% = ||A a my1(A) > Tspan(b) (A) Il F- 
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This combined with the fact that 7span(b)(A) = Tspan()(C) and the definition of C' gives 


|| A _ TV +span(b) (A) II = IC = Reoanthy(G) \|z- 
We can now apply Equation to get 


2 
JA myrsopandy(ADIF S (1+ 2) N= tapi (C) 


Use the definition of C' again and the fact that 7 (C) = Tepan(v)(A) to conclude 


span(v(1)) 


2 
A= mvrsapan(ADIPe S (1+ 2) BA av (A) ~ Mpa (ADI 
: (14) 
= (1+ 2) |4-av(AI- 
Now project A to the subspace orthogonal to b to get a matrix A’ = A — mpan(p)(A). Applying 


the inductive hypothesis to A’ and k’ = k, we get a set S’ of (s + 1)k rows such that (remember 
that V’ is k-dimensional) 


ae pa 
JA’ as a(R s (142) a! mans (142) a= n(n 5) 


The combination of the rows that we got from the two applications of the induction hypothesis 
gives us a set S = S’U S” of (s +1)(k +1) rows of A, and we will see now that this set proves the 
induction for k + 1. 

Note that Tspan(o)(A) — 75’,4(4’) is a matrix of rank at most k +1 whose row-space is contained 
in span (5). This implies 


|A = m5,641(A) Il < ||A — tepan(o)(A) — 19",4(A’)IF, 
using the definition of A’: 
= ||A’ — ts e( AIF: 


and using Equation (15): 


2 k 
< (142) [a'-m(A)IR 


iFrom this, the fact that my: as a projection satisfies || A’ — my/(A’)||?, < || A’ — Bl? for any matrix 
BeER™*” having row-space in V’ gives us: 


2 kb 
JA= rsa Allie s (142) al — ave Adi, 


using the definition of A’ again: 
9\k 
< (142) IA ~tpae)(4) = av (ADI, 
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using the orthogonality of span(b) and V’: 
k 
2 2 
= (142) A= Av-4pa0)(4)IE 
and using Equation (14), 


2 k+1 
s (142) 1A-av(ayie. 


This finishes the induction. The choice of s = 4k gives us the theorem. 


5 Application: projective clustering 


In this section, we give an approximation algorithm for the projective clustering problem described 
in Section [IT] The algorithm is motivated by Theorem [3] Let Vi,...,V; be the optimal subspaces 
partitioning the point set into P;,..., Pj, where P; is the subset of points closest to Vj. Theorem [3] 
states that there exists a subset P; C P; of size 4k?/e in whose span lies an approximately optimal 
k-dimensional subspace W;. We can enumerate over all combinations of j7 subsets, each of size 
4k? /e to find the P,, but we cannot enumerate the infinitely many k-dimensional subspaces lying 
in the span of P.. 

Thus, we cannot hope to find W, given P;. A natural approach to solve this problem would be 
to put a finite grid on the points in the span of P.. The hope would be that there are k grid points 
91,--+;9% Whose span G is “close” to Wj, since each basis vector for W; is close to a grid point. 
Although G may be “close” to W;, there may be a point p € P; for which d(p,G)? >> d(p, W;)?, 
i.e. G incurs a much greater error than W;. Indeed, consider a point p € P; whose distance to the 
origin is very large. Although the distance between a basis vector and a grid point might be small, 
the error induced by projecting a point onto the grid point will be proportional to its distance to 
the origin. 

The problem described above implies that a grid construction must be dependent on the point 
set P;. Our grid construction does depend on P;; we consider grid points in balls around a subset 
of P;. The grid is laid down in the span of P, (actually, a subspace of slightly higher dimension) to 
reduce the size of the grid, which is exponential in the dimension of the points. In Lemma [8} we 
show that there is a subspace F; that is the span of k grid points such that }7,cp, d(p, F;)? is not 
much worse than ype p, Up, W;)?. This construction avoids the problem that points far away from 
the origin will have error proportional to their distance to the origin, since there are grid points 
close to such points. Although we find F; in the span of P,, it can be brought back up to R%, where 
its error with respect to W; is the same as in P,. This is because W; lies in the span of PB. 

The algorithm is given below. 
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Algorithm Cluster 


Input: P C R®@, error parameter 0 < € <1, and B. 
Output: A set of 7 k-dimensional subspaces F) ... Fj, such that 


C({Fi...Fi}) =) min d(p, Fi)” 


pEP 


is at most (1 + €)B provided a solution of cost B exists. 


1. Set 6 = —Y8_, R= \/( + §)B + 26k. 


16jky/(1+§)n’ 


2. For each subset T of P of size 8jk?/e + jk: 


(a) For each equipartition of (7) ...7;) of T into 7 parts: 
i. Construct a d-net D; with radius R for each T; in the span of T;. 
ii. For each way of choosing j subspaces F),...,/;, where F; is the span of k 
points from D,: 
A. Compute the cost C({F) ... F;}). 


3. Report the subspaces fF... Fj; of minimum cost C({F,... F;}). 


In Step we construct a d-net D;. A 6-net D with radius R for a point set S is a set of 
points such that for any point q within distance R of some p € S, there exists a g € D such that 
d(q,g) < 6. To construct a d-net D for a point set S with radius R, we simply put a box of side 
length 2R around each point p € S. Each point at grid length 6/V/d in each box is in D, where d is 
the dimension of the points in S. The size of the d-net is thus exponential in the dimension of the 
points in S. This is why it is crucial that we constuct the 6-net for T; in the span of Tj; if we were 
to simply create a 6-net for T; in the original R@ space, the size of the net would be exponential in 
d! By constructing the d-net D; in the span of T; in Step|2(a)i} the size of D; is instead exponential 
in just 8k?/e +k, the number of points in 7}. 

The correctness of the algorithm relies crucially on the next lemma. 


Lemma 8. Let P be a point set, and let W be a subspace of dimension k such that 


» d(p, w) <a. 


pEP 
There exists a set C C P of k points such that there is a subspace F' with the following properties: 
1. F is the span of k points from a 6-net D with radius /a + 26k for C. 


2. F is not too far from W: 


S° d(p, F)? < S¢ d(p, W)? + 4k? nd? + 4kd S~ d(p, W). (16) 
pEeP pEeP pEP 
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Proof. We simultaneously construct F’ and choose the k points pj,...,pz, in C in k steps. Let 
Fo = W. Inductively, in step 7, we choose a point p; to put in C and rotate F;_1 so that it includes 
a grid point g; around p;. The subspace resulting from the last rotation, F;,, is the subspace F’ with 
the bound promised by the lemma. To prove that holds, we prove the following inequality for 
any point p € P going from F;_; to F; 


d(p, Fi) < d(p, Fi-1) + 26. (17) 


Summing over the k steps, squaring, and summing over n points, we have the desired result. 
Let Gi = {0}. G; will be the span of the grid points {g1,92,...,gi-1}. We describe how to 
construct the rotation R;. Let p; € P maximize 


II™e2 (TR. (P))II 


and let g; € D minimize 
(7 F,_, (Pi), 9)- 


Put p; in C. Consider the plane Z defined by 7¢1(gi),™e1(7F,_, (pi)), and 0. Let @ be the angle 
between 71(gi) and moi (7F,_,(pi)). Let Rj be the rotation in the plane Z by the angle 0, and 
define F; = RF 1. Set Cx = G; + span{g;}. 
Now we prove inequality (17). We do so by proving the following inequality by induction on i 
for any point p: 
d(rr,_.(p), Ritr,_.(0)) < 28. (18) 


Note that this proves by applying the triangle inequality, since: 


Up, Fi) < d(p, Fi-1) + d(tr,_,(p), TF, (P)) 
< d(p, Pia) + d(mr,_,(p), Ritr,_,(p))- 


The base case of the inequality, 7 = 1, is trivial. Consider the inductive case; here, we are bounding 
the distance between 7p,_,(p) and Riwp,_,(p). It suffices to bound the distance between these two 
points in the subspace orthogonal to G;, since the rotation R; is chosen orthogonal to G;. That is, 


d(tr,_1(p), Rites (p)) < det (tH (P)), Rites (TH (P)))- 


Now, consider the distance between a point toi (mm7_,(p)) and its rotation, Ritoi(tm_,(p)). This 
distance is maximized when ||7¢.(7r,_,(p))|| is maximized, so we have, by construction, that the 
maximum value is achieved by p;: 


U(r gs (™H,_, (P)), Rimes (THs (P))) < dmg (THs (Pi), RimGs (™H_, (Pi)))- 
By the triangle inequality we have: 
d(mGs (tr (pi), Rites (TR, (Pi) S Amat (TH. (Pi), Taz (91) + UT eE2 (G1), Rites (TR; (Pi)))- 
To bound the first term, d(t¢1(7R,_,(pi)), Tet (gi)), note that 


dmg (tr, (Pi), Maz (Gi)) S UmR,_, (Pi); 9). 
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We show that 77,_,(p;) is within a ball of radius R around p;; this implies 


d(tF,_, (pi), 91) < 6 (19) 


by construction of the 6-net around p;. We have: 


2 
d(pi, TF,_, (pi) a d(pi, Fo) “Lae i) TWFi41(Pi)) 
< Vat s (mr, (pi), Rj+17F; (pi) 
j=l 
< Va+25k=R 


The third line uses the induction hypothesis. 

Now we bound the second term, dmg (gi), Rings (wr,_,(p:))). Note that Rites (tRr_, (pi))) 
is just a rescaling of Tos (G) and that Iron (mm (p*))|| = ||Ritet (tRr_, (p*))|I, since rotation 
preserves norms. The bound on the first term implies that lures (Cll = laren (ar,_,(p*))|| — 6, so 


d(ngs (gi), Rings (mr,_.(Pi))) <6. (20) 


Combining and (20), we have proved (18). 


Now, we are ready to prove Theorem |4| which proves the correctness of the algorithm. 


Proof. Assume that the optimal solution is of value at most B. Let Vj,...,Vj; be the optimal 
subspaces, and let P),...,P; be the partition of P such that P; is the set af points closest to Vj. 
Let nj = |P|, with S0,n; =n. By Theorem {3} 3| there exists a subset S; of P; of size at most 8k?/e 
such that there is a subspace W; in the span of S; with 


> de, < (1+ 5) devi). (21) 
pEP; pEP; 


Consider each subset P; and its corresponding subspace W;. Now, apply Lemma [8] to P,; and W; 
using R,6 as in the algorithm. Since the optimal solution is of value at most B, we have that, for 


every 2: 
S> d(p, Wi)? < (1 er =) B. 
peP, 


Let C;, F; be the set of k points and subspace, respectively, promised by the lemma. F; is the span 
of k points from a 6-net of C; and obeys the following inequality: 


S— d(p, Fi)? < ( 5) >. Up, Vi) oof get Gg? 
pEeP; pEeP; 


Let T = Uj; $iUC;. We have that |T| = 8jk?/e+jk. The algorithm will enumerate some set T’ D T 
in Step 2} Furthermore, it will consider a partition {Tj ...Tj} such that T/ > $; UC; in Step al 
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Let D‘ be the 6-net for T/ projected to the span of T/. Since the algorithm enumerates over 
all subspaces lying in the span of D/, the algorithm will consider the subspaces F\,..., Fj; whose 
existence is proven above. The cost associated with F),..., Fj is: 


ij 
C{FiveE}). = 3) Ss a@ey 


i=l pEP,; 
€ j € € 
-\2 
& (+5) 30 dew) Sree ea 
i=1 pEeP; 
< (1+6)B. 


The running time analysis follows by the following bounds. The number of subsets of size 
8jk?/e+ jk is at most n8sk°/e+3k The number of equipartitions of a set of size 8jk?/e+ jk into j 
parts is at most 7° k/e+Jk Recall that the d-net D for a point set T of dimension d is implemented 
by putting a box with side length 2R of grid width 6/Vd around each point in T. Let X be the set 
of grid points in the box around a point p. The number of subspaces in each d-net D; is therefore at 
most ((8k?/e + k) \x|)*, so the number of j subspaces that one can choose for a partition (7) ...Tj) 
is ((8k? Jet k) |X l)’ * The computation of projecting points, finding a basis, and determining the 
cost of a candidate family of subspaces takes time O(ndjk). The cardinality of X is (remember 


that € < 1): 
8k? /e+k 8k? /etk 
2R n 
X| = | ——____. <|O ie" ) : 
a (ae) ( (iz) 


Therefore, the running time of the algorithm is at most 


: : é é ; O(k33/e€ 
O(ndjk) ni leak IK /e+7k (8K Je +k) |X|)" = d (=) wie 
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